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ABSTRACT 

Expression Atlas (http://www.ebi.ac.uk/gxa) is a 
value-added database providing information about 
gene, protein and splice variant expression in differ- 
ent cell types, organism parts, developmental stages, 
diseases and other biological and experimental con- 
ditions. The database consists of selected high- 
quality microarray and RNA-sequencing experiments 
from ArrayExpress that have been manually curated, 
annotated with Experimental Factor Ontology terms 
and processed using standardized microarray and 
RNA-sequencing analysis methods. The new version 
of Expression Atlas introduces the concept of 
'baseline' expression, i.e. gene and splice variant 
abundance levels in healthy or untreated conditions, 
such as tissues or cell types. Differential gene expres- 
sion data benefit from an in-depth curation of experi- 
mental intent, resulting in biologically meaningful 
'contrasts', i.e. instances of differential pairwise com- 
parisons between two sets of biological replicates. 
Other novel aspects of Expression Atlas are its 
strict quality control of raw experimental data, up- 
to-date RNA-sequencing analysis methods, expres- 
sion data at the level of gene sets, as well as genes 
and a more powerful search interface designed to 
maximize the biological value provided to the user. 

INTRODUCTION 

Expression Atlas is a further development of our previous 
version of Gene Expression Atlas (1), launched by the 



European Bioinformatics Institute (EBI) in 2008, and con- 
tinues its original remit as a value-added database for 
querying differential gene expression across tissues, cell 
types and ceh lines under various biological conditions. 
These include developmental stages, physiological states, 
phenotypes and diseases and cover multiple organisms. 
Expression Atlas is developed with a view to accommo- 
date data from multi-omics experiments, such as prote- 
omics. High-quality microarray and RNA-sequencing 
data in Expression Atlas continues to come from 
ArrayExpress (2), including data imported from GEO 
(3). Differential expression is reported for both coding 
and non-coding transcripts. The sample attributes and 
experimental factors (i.e. conditions under study) are 
systematized and mapped to the Experimental Factor 
Ontology [EFO (4)]. 

In particular. Expression Atlas introduces the concept 
of basehne expression — the abundance of each gene and 
splice variant in healthy or untreated tissues, cell types or 
cellular components. Baseline expression is reported 
within a species-specitic context of selected large RNA- 
sequencing experiments and provides a useful reference 
for the user when considering differential expression data. 

Expression Atlas continues to analyse and report stat- 
istically robust differential expression for both coding and 
non-coding transcripts. However, the biological relevance 
of these data has been vastly improved due to an in-depth 
manual curation of the experimental intent that for each 
differential experiment yields a set of 'contrasts', i.e. in- 
stances of differential pairwise comparisons between two 
sets of biological repUcates — the 'reference' (e.g. 'healthy' 
or 'wild type') set and a 'test' set (e.g. 'diseased' or 
'mutant'). Each of these sets is typically described by a 
number of sample attributes and experimental factors. 
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For example, all biological replicates treated with a test 
compound may be compared with untreated samples. 
Statistical analysis is then performed, providing P-values 
and, (for microarray only) ^-statistics, linking each gene to 
differential contrasts in each experiment. 

Another novel aspect of Expression Atlas is its focus on 
quahty control of raw experimental data and of experi- 
mental design. A minimum acceptable number of biolo- 
gical sample replicates (three) is also enforced to ensure 
sufficient statistical power to detect differential expression. 
Before submission into analysis pipelines, all experimental 
raw data undergo quality control. In the case of RNA- 
sequencing experiments, poor quahty reads and those 
originating from contamination are excluded from 
further analysis. Outlier arrays in microarray experiments 
are also removed before manual contrast identification 
and statistical analysis. 

The focus on the quality of raw data and experimental 
design has led to exclusion of low-quality experiments. 
The manual curation of contrasts for all ehgible experi- 
ments is also on-going, leading to a temporary reduction 
in the number of experiments in Expression Atlas. 

Support of reproducible analysis is provided for each 
experiment by fisting analysis methods and versions used 
for processing its raw data, offering links to source code 
where possible, as well as showing the version of Ensembl 
genome reference used for mapping (for RNA-sequencing 
experiments), and the version of miRBase (5) release from 
which probe-set to microRNA mappings was taken for 
niicroRNA microarray experiments. The user should 
thus be able to reproduce the results presented in 
Expression Atlas, by analysing the raw experimental 
data using the methods listed for that experiment. 

Expression Atlas search interface allows for querying 
gene, splice variant or protein attributes (including 
organism), at the level of individual genes or whole gene 
sets. The user can also search for sample attributes and 
experimental factors. Both baseline and differential com- 
ponents of Expression Atlas are queried by default. The 
experiments returned are those in which the queried 
sample attributes match either the studied healthy or un- 
treated biological conditions, e.g. tissues or cell types 
(basehne expression), or match either a 'test' or a 'refer- 
ence' side of a differential contrast (differential expres- 
sion). Finally, the set of queried experiments can be 
restricted by providing a hst of accessions, keywords or 
the species of samples studied in them. 

The RNA-seq processing pipeline used to generate data 
for Expression Atlas is shown in Figure 1. The full details 
of material and methods used to generate expression data 
shown in Expression Atlas interface are available in the 
Supplementary Material. 

RESULTS 

Data 

As of 24 September 13, Expression Atlas contains highly 
curated data from 214 experiments, including four basehne 
RNA-sequencing experiments (nine species) and 210 differ- 
ential experiments (13 species). Basehne experiments 



include Ilhimina Body Map (http://www.ebi.ac.uk/gxa/ 
experiments/E-MTAB-513) and Encode Cell Lines (http:// 
www.ebi.ac.uk/gxa/experiments/E-GEOD-26284). Differ- 
ential experiments include 10 RNA-sequencing and 200 
microarray experiments — mainly single-channel experi- 
ments performed on gene arrays. Finally, microarray ex- 
periments studying microRNAs are also available (e.g. 
http: //www.ebi.ac.uk/gxa/experiments/E-TABM-7 13). 

New user interface features 

Expression Atlas offers a separate page for each experi- 
ment, as well as pages presenting baseline and differential 
expression data for each gene, protein, gene set (e.g. 
REACTOME pathway) and experimental condition (e.g. 
'heart') stored in Atlas. 

Baseline expression (Figure 2) 

Users can search a baseline experiment with gene names, 
protein accessions, gene, protein or splice variant identi- 
fiers, keywords, biotypes (e.g. 'protein coding'), GO and 
InterPro terms as well as Reactome pathway IDs. 
Optionally, each term (e.g. REACTOME pathway ID) 
can be interpreted as a gene set, offering the user an 
aggregated expression level across all genes in each 
queried gene set. Users may also search using studied ex- 
perimental conditions (e.g. tissue in Figure 2). By default, 
search results are ordered such that genes that are most 
specifically expressed in the experimental condition(s) of 
interest are at the top. This is implemented by rewarding 
higher expression in the conditions of interest and as low as 
possible expression in the remaining conditions. 
Optionally, the user may wish to search for 'non-specific' 
expression — in this scenario genes with high expression in 
query conditions are not only rewarded but also not 
penalized for high expression in non-query conditions. 
This type of query typically returns 'house-keeping' genes 
at the top of the results table, i.e. those with high levels of 
expression in the majority of experimental conditions. 
Expression levels below the displayed FPKM cut-off (0.5 
by default) are treated as background (i.e. 'noise'). The user 
is free to select a different expression level cut-off — a histo- 
gram breaking down the number of genes expressed above 
a given cut-off is included to help the user decide which cut- 
off to use for their query of interest. As FPKMs are already 
a gross approximation of gene expression, the resulting 
matrix encodes the expression level by way of a heatmap, 
though the actual FPKM values can be displayed and are 
downloadable from the experiment page. Finally, clicking 
on a non-empty heatmap cell shows a breakdown of the 
three most abundant splice variants for the corresponding 
gene and experimental condition. 

Differential expression ( Figure 3 ) 

Users may search a differential experiment by the same 
gene properties and keywords as listed earlier in text 
for basehne experiments, additionally selecting the type 
of differential expression of interest (up/down by 
default). A differential contrast dropdown is also avail- 
able — the default search is for differential expression in 
any contrast, but the user can also choose one or more 
contrast of interest. By default, the search returns first 
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Figure 1. The RNA-seq processing pipeline used to generate data for Expression Atlas. The experimental metadata is retrieved from ArrayExpress. 
The raw FASTQ files, retrieved from European Nucleotide Archive, undergo a quality control procedure via FASTQC package to remove low- 
quahty reads and uncalled bases. Subsequently, contaminated reads (e.g. bacterial in the cases of vertebrate samples) are removed. TopHat 1 is used 
for mapping the reads to the reference genome, Cufflinks 1 quantifies baseline expression for genes and transcripts and HTseq quantifies expression 
used for subsequent differential expression analysis with DESeq. The final (summarized) basehne expression count for a gene in a condition is a 
median across first technical replicates, then across biological replicates corresponding to that condition. 
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Figure 2. Example baseline expression experiment page, with help annotations — lUumina Body Map. (For further information see: http://www.ebi.ac. 
uk/gxa/help/basehne-atlas.html). 
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Figure 3. Example differential expression page, with help annotations — Transcription profiling hy array of Drosopiiila melanogaster C'DKS and Cyclin 
C homozygous mutants, determined using 'Affymetrix GeneChip Drosophila Genome 2.0 Array'. (For further information see: http://www.ebi.ac. 
uk/gxa/help/differential-atlas.html). Genes that were called as differentially expressed at FDR < 0.05 are shown in red in the MA plot. 



genes that are differentially expressed most specifically in 
the queried contrast(s). This is achieved by promoting to 
the top genes with lowest P-values in the contrast(s) of 
interest and at the same time penalizing genes with low 
/"-values in the remaining contrasts. Optionally, the user 
may perform a 'non-specific' search, in which genes with 
lowest P-values in the selected contrast(s) come first, irre- 
spective of whether they are reported with low /"-values in 
the remaining contrasts. The results of this analysis are 
presented to the user in a matrix, with genes (and design 
elements — for microarray only) as row labels, and con- 
trasts as column labels. The results are sorted by 
/'-value; the /-statistics and log2-fold changes are also 
available. As part of the differential analysis, 'MA' plots 
are shown for the default FDR of 0.05. The user is able to 
choose a different FDR and observe in the resulting 
matrix, what effect this has had on the results. The differ- 
ential experiment page offers downloads of analytics data 
as well as raw counts (RNA-sequencing), normalized ex- 
pression values (one-colour microarray) and log2-ratios 
(two-colour microarray), respectively. Finally, experimen- 
tal conditions for each contrast can be viewed via mouse- 
over on contrast column headers, in the results matrix, 
and on the experiment design page, available via a 
button in the top-right corner of the experiment page. 

Gene I protein I gene-set page 

For each gene, protein and gene set (e.g. Reactome 
pathway ID), Expression Atlas provides a summary page 
that contains, at most, three separate panes (Figure 4). The 
top pane contains extensive annotation for the represented 



bio-entity, including links to external resources, its 
orthologues and so forth. The middle pane shows 
baseline expression information from the representative 
baseline experiment in which the bio-entity was studied. 
For gene sets, the aggregated expression levels across all 
genes in the set are shown for each experimental condition. 
Finally, the bottom pane (Figure 5) shows differential ex- 
pression, sorted by /"-value, across all contrasts in experi- 
ments available in Expression Atlas. Mouse-over on a 
contrast description shows experimental conditions 
describing the test and the reference sides of that contrast 
(shown in Figure 5 for the top contrast); clicking on a 
contrast takes the user to the page of the experiment 
from which the analytics were retrieved. 

Experiment list page 

This page (http://www.ebi.ac.uk/gxa/experiments) 
presents a sortable and searchable list of aU experiments 
currently loaded in Expression Atlas, documenting, 
among other things, experiment type (baseline or differen- 
tial), the number of assays analysed for that experiment, 
the organisms and experimental conditions studied, the 
number of contrasts identified (differential experiments 
only) and the array designs used in the experiment (micro- 
array only). 

Atlas infrastructure developments 
Software availability 

The Expression Atlas software is designed to run in-house 
only. However, the software source code can be accessed 
via http://github.com/gxa/atlas. 
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Figure 4. Baseline expression on summary page example for human BRCAl gene: http://www.ebi.ac.uk/gxa/genes/ENSG00000012048. 



Release process 

Gene Expression Atlas, described in our previous update, 
released its data and software on a monthly basis. 
Expression Atlas will also release data regularly, providing 
individual experimental data and tar-gzip snapshots of all 
the data (for ease of download) on the EMBL-EBI FTP 
server (ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/ 
atlas). The web services software will be released regularly, 
with appropriate release notes notifying users of function- 
ality changes. 



FUTURE DIRECTIONS 

Protein expression 

Expression Atlas is intended as a multiomics, and in par- 
ticular as a functional genomics and proteomics, resource, 
incorporating expression of not only genes but also sphce 
variants and proteins. Although the quantitation and stat- 
istical analysis of gene expression methods is relatively 
mature and well established, the equivalent methods for 
protein detection, quantification and statistical analysis 
are still active areas of research. Consequently, in the 
first instance, we wiU include protein expression data as 
additional information to the transcriptomics data in the 
baseline component of Expression Atlas only. EFO will be 



used to identify data sets with corresponding sample de- 
scriptions in PRIDE database (6). Expression of each 
protein in those sets will be shown within the context of 
the baseline expression of the particular gene coding for 
the protein in the corresponding experimental condition. 
Appropriate provenance will be attributed to each source 
of protein expression data within Expression Atlas 
interface. 

Baseline expression data improvements 

We plan to increase our baseline expression coverage to 
experiments in novel species, containing greater resolution 
of studied factors, e.g. tissues, as well as with greater bio- 
logical replication of studied samples — in aid of more 
robust analysis results presented to the user. The 
basehne expression analysis will also include data sets 
that study heterogeneity among individuals and, for 
example, tissues, focusing on variation data, expression 
quantitative trait loci and mutations. 

Expression visualization improvements 

We will make transcript expression levels more prominent 
in our experiment pages, focusing on genome browser 
coverage views of expression — allowing the user to 
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Figure 5. Differential expression on summary page for human BRCAl gene: http://www.ebi.ac.uk/gxa/genes/ENSG00000012048. 



observe in detail how expression is distributed across dif- 
ferent exons and transcripts of a given gene. 

Baseline expression aggregation 

We are worlcing on methods to aggregate expressions of a 
gene, in a given experimental condition, across aU applic- 
able RNA-sequencing experiments, so that a single expres- 
sion level for that 'gene-experimental condition' 
combination can be shown to the user. 

Gene set enrichment analysis 

Currently, only basehne expression summaries for gene 
sets are offered in Expression Atlas interface. Pre- 
computed gene set enrichment analysis results in the 
context of differential expression will be offered, for 
example, InterPro, GO terms and REACTOME 
pathways. The results of this analysis will be shown in 
the corresponding gene set summary page. Users will 
also be able to submit an arbitrary set of genes to 
quantify enrichment against all contrasts/differential 
gene sets present in Expression Atlas. Such queries may 
be submitted together with experimental conditions to 
restrict the set of contrasts to analyse the enrichment in. 



Hom(e)ologue expression 

The Expression Atlas interface will facilitate gene co-ex- 
pression analysis, including paralogues and homeologues 
where applicable, as well as comparative analysis of 
expression of orthologues. 

MicroRNA RNA-sequencing experiments 

The pipeline used to process RNA-sequencing data for 
Expression Atlas will be enhanced to analyse microRNA 
RNA-sequencing experiments. Subsequently, good quahty 
microRNA RNA-sequencing experiments available in 
ArrayExpress will be re-processed and included in 
Expression Atlas. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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